Fixes continuation row grouping and add strictness#10
Merged
Conversation
Resolves critical issues where continuation rows (rows with empty identifier columns) were not properly grouped in CsvStreamParser, causing discrepancies between streaming and non-streaming parsing results. Changes CsvStreamParser to buffer continuation groups by default, matching CsvParser behavior. Adds maxContinuationGroupSize guard (default: 10000) to prevent unbounded memory growth when identifier values are missing. Improves identifierColumn validation to throw early when configured column doesn't exist after filtering/transformation, requiring transformed column names when headerTransformer or columnMapping are used. Fixes nested array handling in JsonToCsv to properly emit child continuation values under parent array items rather than at root level. Enhances documentation to clarify parseStream() buffers full content in memory, while CsvStreamParser provides true incremental processing with continuation grouping. Updates dependencies and fixes quote character handling in splitLines to properly track escaped quotes.
ivo-rws
approved these changes
Apr 9, 2026
IvoCerios
approved these changes
Apr 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces important changes to the CSV parsing library to improve consistency between batch and streaming parsing, enforce stricter validation, and enhance memory safety. The main changes include aligning the grouping behavior of
CsvStreamParserwithCsvParser, enforcing strict validation of theidentifierColumn, clarifying documentation, and adding a memory safeguard for continuation groups. Additionally, several development dependencies have been updated.CSV Parsing Consistency and Validation:
CsvStreamParsernow always emits nested grouped output, matchingCsvParsercontinuation-row semantics. The previousnestedoption is removed to avoid divergence and ensure consistent grouping of continuation rows in both batch and streaming APIs. [1] [2] [3]identifierColumnvalidation: if the configured identifier column is missing from headers, parsing throws aCsvParseErrorinstead of continuing ambiguously. Additionally, a continuation row cannot start a group; if the first data row has an empty identifier, parsing throwsCsvParseError. [1] [2]Streaming Parser Improvements:
maxContinuationGroupSizeoption (default: 10,000) toCsvStreamParserto prevent unbounded memory usage when identifier values are missing for long stretches. Exceeding this limit throws aCsvParseError. [1] [2] [3] [4]Documentation Updates:
CsvParser.parseStream()(buffers entire stream in memory) andCsvStreamParser(true streaming, memory efficient, always groups continuation rows). Also clarified options such asincludeColumns,excludeColumns, null handling, and the newmaxContinuationGroupSizesafeguard. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]Bug Fixes and Internal Improvements:
CsvReader.splitLines()to handle custom quote characters and escaped quotes correctly. [1] [2] [3] [4]These changes make the CSV parsing behavior more predictable, robust, and safe for large-scale streaming workloads.